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TTTLE OF THE INVENTION 
WIDE CONNECTIONS FOR TRANSFERRING DATA BETWEEN PE' S OF AN 
N-DIMENSIONAL MESH-CONNECTED SIMD ARRAY WHILE 
5 TRANSFERRING OPERANDS FROM MEMORY 

CROSS REFERENCE TO RELATED APPLICATIONS 
This application claims priority of U.S. Provisional 
Patent Application No. 60/161,587 filed October 26, 1999 
10 entitled FINITE DIFFERENCE ACCELERATOR. 

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR 

DEVELOPMENT 
N/A 

15 

BACKGROUND OF THE INVENTION 
The present invention relates generally to SIMD 
array processors^ and more specifically to SIMD array 
processors having improved data transfer efficiency 

20 between processing elements incorporated therein. 

Single-Instruction Multiple-Data (SIMD) array 
processors are known which comprise multi-dimensional 
arrays of interconnected processing elements executing 
the same instruction simultaneously on a plurality of 

2 5 different data samples* For example, an SIMD array 

processor may include a two-dimensional array of 
processing elements in which each processing element is 
connected to its four (4) nearest neighboring processing 
elements to form a "North, East, West^ South (NEWS) 

30 array" . In such NEWS arrays^, each processing element can 
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communicate directly with its North, East, West, and 
South neighbors. 

One aspect of the typical SIMD array processor that 
limits the rate at which the processing elements can 
5 communicate with each other is that one or more of the 

neighboring processing elements with which a particular 
processing element communicates may be physically located 
on a different Application Specific Integrated Circuit 
(ASIC) and/or on a different Printed Circuit Board (PCB) . 

10 For example, when a processing element directly accesses 

a multi-bit data sample from a neighboring processing 
element physically located on a different ASIC or a 
different PCB, a significant amount of time may be 
required for that data sample to propagate between the 

15 ASIC's or PCB's, To account for this propagation time, 

communication registers used in processing elements of 
the typical SIMD array processor are generally clocked at 
relatively low speeds. However, clocking communication 
registers at such low speeds may cause many operating 

20 cycles of a processing element to be wasted while the 

processing element waits for the data transfer to 
complete. As a result, the typical SIMD array processor 
may not be suitable for some high-speed data processing 
applications , 

25 It would . therefore be desirable to have an SIMD 

array processor that has improved data transfer 
efficiency between processing elements incorporated 
therein* Such an SIMD array processor would transfer 
data samples more efficiently whether or not neighboring 

30 processing elements are located on the same ASIC or PCB. 
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BRIEF SUMMARY OF THE INVENTION 
In accordance with the present invention, an SIMD 
array processor is provided in which data transfer 
eliiciency is enhanced between processing elements 
5 included therein. The SIMD array processor includes a 

plurality of mesh-connected processing elements 
configured in a multi-dimensional array. Each processing 
element includes at least one ''narrow'' memory buffer, at 
ieasu one "^wide" data register, and at .least one "^wide" 

10 communication register. The narrow memory buffer is 

adapted to transfer data serially between a memory and 
the wide data register; and, the wide data register is 
adapted to transfer data directly to the wide 
communication register. Further, the wide communication 

15 register is adapted to transfer data directly to the 

communication register of a neighboring processing 
. element while the memory buffer accesses data from 
memory. In a preferred embodiment^ the memory buffer has 
a width of one (1) bit to allow bit-serial data transfer 

20 between the memory and the wide data register while the 

wide communication register transfers data in parallel to 
(from) the communication register of the neighboring 
processing element • 

Other features, functions, and aspects of the 

25 invention will be evident from the Detailed Description 

of the Invention that follows. 

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING 
The invention will be more fully understood with 
30 reference to the following Detailed Description of the 

Invention in conjunction with the drawings of which: 
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Fig* 1 is a block diagram depicting an SIMD array 
processor according to the present invention; 

Fig. 2 is a block diagram depicting a processing 
element included in the SIMD array processor of Fig. 1; 
5 and 

Fig. 3 is a table depicting the contents of 
respective YS;. NEWSO;. and NEWSI registers of two (2) 
neighboring processing elements included in the SIMD 
array processor of Fig. 1, in an illustrative example of 
10 a data transfer between the two (2) processing elements. 



DETAILED DESCRIPTION OF THE INVENTION 
The entire disclosure of U.S. Provisional Patent 
Application No. 60/151,587 filed October 26, 1999 is 

15 incorporated herein by reference. 

Fig. 1 is a block diagram depicting an illustrative 
embodiment of an SIMD array processor 100 in accordance 
with the present invention. The SIMD array processor 100 
includes a plurality of identical Processing Elements 

20 (PE's) 104 through 134 interconnected in a mesh and 

configured as a two-dimensional NEWS array. The PE' s 104 
through 134 are communxcably coupled to respective bi- 
directional Input/Output ' s (I/O's) 0 through 15 of a 
Random Access Memory (RAM) 102. In a preferred 

25 embodiment/ the RAM 102 is a Synchronous Dynamic RAM 

(SDRAM) . 

Although Fig. 1 depicts the SIMD array processor 100 
as including the two-dimensional NEWS array of PE' s 104 
through 134, it should be understood that the SIMD array 
30 processor 100 may comprise an array of PE' s having one 

(1) or more dimensions. It should also be understood 
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that the size of the array may be adjusted. In a 
preferred embodiment, the NEWS array of PE' s is 
implemented on an ASIC comprising an 8x8 NEWS array. 
Fig. 1 depicts the SIMD array processor 100 as including 
5 a 4 X 4 NEWS array for clarity of discussion. 

In a preferred embodiment, the PE's 104 through 134 
are physically located on the same ASIC to form the NEWS 
array, thereby simplifying I/O connections between 
nearest neighboring PE's in the NEWS array. For example^ 

10 relative to the PE 114, a ''North'' I/O 140 interconnects 

the PE 114 and the PE 112; an "'East'' I/O 144 
interconnects the PE 114 and the PE 122; a "West" I/O 14 6 
interconnects the PE 114 and the PE 106; and, a "'South" 
1/0 142 interconnects the PE 114 and the PE 116. 

15 Further, the PE'^ s 104, 106, 108, 110, 112, 118, 120, 126, 

128, 130, 132, and 134 that are physically located along 
the edges of the NEWS array comprise suitable North, 
East, West, and South I/O's for connecting these PE' s 
with PE' s that are physically located on different 

20 ASIC's. For example, a 2 x 2 array of ASIC's may be 

implemented on a PCB; and. North, East, West, and South 
1/0' s of PE's physically located along the edges of NEWS 
arrays of adjacent ASIC's may be suitably interconnected. 
Still further, respective 2x2 arrays of ASIC's may be 

25 implemented on different PCB's; and, the North, East, 

West, and South I/O' s of PE' s physically located along 
the edges of the respective arrays of ASIC's may be 
suitably interconnected. In this way, the size of the 
NEWS array comprising the plurality of identical PE' s can 

30 be increased to satisfy the processing requirements of 

the target application, 
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In this illustrative embodiment each of the PE'^s 
104 through 134 is connected to one and only one of the 
I/O's 0 through 15 of the RAM 102. For example, the PE' s 
104, 106, 108, and 110 are connected to the I/O's 15, 14, 
5 13, and 12, respectively; the EE's 112, 114, 116, and 118 

are connected to the I/O's 11, 10, 9, and 8, 
respectively; the EE's 120, 122, 124, and 126 are 
connected to the I/O's 7, 6, 5, and 4, respectively; and, 
the EE's 128, 130, 132, and 134 are connected to the 

10 I/O's 3, 2, 1, and 0, respectively, of the RAM 102, In a 

preferred embodiment, the EE's 104 through 134 utilize 
the I/O's 0 through 15 of the RAM 102 to access data in 
the RAM 102 in a bit-serial fashion. 

Those of ordinary skill in this art will appreciate 

15 that the SIMD array processor 100 of Fig. 1 may be 

incorporated in a multi-dimensional processing system 
including, e-g., a command preprocessor interfaced with a 
processor controller that provides intermediary 
processing functions between the command preprocessor and 

20 the SIMD array processor 100. 

Fig. 2 is a block diagram depicting an illustrative 
embodiment of a PE 200 in accordance with the present 
invention. In a preferred embodiment, the EE 200 is 
representative of each of the EE' s 104 through 134 

25 included in the SIMD array processor 100. Accordingly, 

the SIMD array processor 100 preferably includes a 
plurality of identical EE' s such as the representative PE 
200 interconnected in the 4x4 NEWS array configuration* 
In the illustrated embodiment, the PE 200 includes a 

30 multi'bit data register YS 204. In a preferred 

embodiment, the data register YS 204 is 64-bits wide. It 
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should be appreciated that the data register YS 204 can 
be used to store, ©-gr,, a floating-point number or a 
signed/unsigned fixed-point integer. It should also be 
appreciated that the data register YS 204 can be used as 
5 a shift register- For example, the data register YS 204 

may be used to shift in binary values in " a bit-serial 
fashion from a memory buffer M 202 by way of a line 215. 
In a preferred embodiment, the memory buffer M 202 is a 
1-bit wide memory buffer. Moreover, the data register YS 
10 204 may be used to shift left binary values contained 

therein . 

As described above, the PE's 104 through 134 (see 
Fig. 1) are preferably physically located on an ASIC to 
form a NEWS array to simplify I/O interconnections 

15 between the nearest neighboring PE' s in the NEWS array. 

In a preferred embodiment, each of the PE's 104 through 
134 reads four (4) bits of data directly (in parallel) 
from a nearest neighboring PE connected to its North, 
East, West, or South I/O while writing four (4) bits of 

20 data in parallel to an opposite neighboring PE. For 

example, the PE 114 may read four (4) bits of data in 
parallel from the PE 112 via the North I/O 140 while 
writing four (4) bits of data in parallel to the opposite 
neighbor PE 116 via the South I/O 142, and vice versa 

25 (see Fig, 1}; and, the PE 114 may read four (4) bits of 

data in parallel from the PE 106 via the West I/O 14 6 
while writing four (4) bits of data in parallel to the 
opposite neighbor PE 122 via the East I/O 144, and vice 
versa (see Fig. 1). 

30 In the illustrated embodiment, the PE 200 (see Fig. 

2) includes communication registers NEWS Input (NEWSI) 
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20 6 and NEWS Output (NEWSO) 208. In a preferred 

embodiment, each of the communication registers NEWSI 2 06 
and NEWSO 208 is 4 bits wide for reading (writing) four 
(4) bits of data in parallel from (to) a nearest 
5 neighboring PE connected to its North, East, West^ or 

South I/O. Specifically, the communication register 
NEWSI 206 receives four (4) data bits in parallel from 
one of the nearest neighboring North, East, West, and 
South PE's and preferably loads the four (4) data bits in 

10 parallel into the four (4) Least Significant Bit (LSB) 

positions {i.e., bit positions 0, 1, 2, and 3) of the 
data register YS 204 by way of a bus 218. Further, the 
data register YS 204 preferably loads four (4) data bits 
in parallel into the communication register NEWSO 208 by 

15 way of a bus 214 from its four (4) LSB positions for 

subsequent provision in parallel to one of the nearest 
neighboring North, East, West, and South PE's. The PE 
200 includes a multiplexor (MUX) 210 that selects one of 
the 4-bit wide data buses coupled to respective 

20 communication registers NEWSO (not shown) of the nearest 

neighboring North, East, West, or South PE. The PE 200 
also includes circuitry that enables the output of the 
communication register NEWSO 208 to drive only the bus 
that is connected to the nearest PE in the opposite 

25 direction. It is noted that the SIMD array processor 100 

preferably includes a sequencer (not shown) for, e,g., 
decoding commands provided by a processor controller to 
obtain a stream of instructions and broadcasting the 
instruction stream to the NEWS array of PE' s 104 through 

30 134*' Such a sequencer includes a control register for 

providing a 2-bit control word R that controls the 
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selection of one of the above-mentioned four (4) data 
buses by the MUX 210 and the bus that . the communication 
register NEWSO 208 drives with its output. 

The PE 200 includes the memory buffer M 202, which 
5 preferably comprises a 1-bit wide memory buffer through 

which all data are transferred between the PE 200 and the 
RAM 102- In a preferred embodiment, the memory buffer M 
202 included in each of the PE' s 104 through 134 is 
connected to one and only one of the 1/0' s 0 through 15 
10 of the RAM 102 for accessing a single bit of data during 

each memory cycle of the R7^ 102 » For example, a single 
data bit may be read from one of the I/O' s by way of the 
memory buffer M 202 and provided to the data register YS 
204 - 

15 As explained above, the PE' s 104 through 134 (see 

Fig. 1) are preferably physically located on the same 
ASIC to form a NEWS array, and arrays of such ASIC s may 
be implemented on respective PCB' s with North, East, 
West, and South I/O's of nearest neighboring PE's 

20 suitably interconnected. As a result, one or more of the 

nearest neighboring PE' s with which a particular PE 
communicates may be physically located on a different 
ASIC and/or on a different PCB. 

Those of ordinary skill in this art will appreciate 

25 that when a PE directly accesses a multi-bit data sample 

from a neighboring PE physically located on a different 
ASIC or a different PCB, a significant amount of time may 
be required for that data sample to propagate between the 
ASIC's or PCB's- In a preferred embodiment, the SIMD 

30 array processor 100 includes circuitry for generating a 

clock signal for latching four (4) bits of data in the 
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respective communication registers NEWSI 206 and NEWSO 
208; and, each ASIC preferably includes circuitry for 
generating respective clock signals used by the PE' s 104 
through 134 and the RAM 102 physically located thereon, 
5 The clock speed of the conimunicat ion registers NEWSI 206 

and NEWSO 20B is preferably one-fourth of the memory 
clock speed; and, the memory clock speed is preferably 
one^half of the PE clock speed* This means that the 
communication registers NEWSI 206 and NEWSO 208 can latch 

10 their respective inputs once every eight (8) PE clock 

cycles. As a result, the communication register NEWSO 
208 holds its latched data for nearly eight (8) PE clock 
cycles, and the communication register NEWSI 206 reads 
four (4) bits of data from the communication register 

15 NEWSO of a neighboring PE (which may be physically 

located on a different ASiC or a different PCB) only 
after the data has settled for several consecutive PE 
clock cycles, thereby allowing sufficient time for the 
data to propagate, 

2 0 The embodiments disclosed herein will be better 

understood with reference to the following illustrative 
example. In the following example^ it is understood that 
the operating speed of the PE is twice that of the memory 
to allow the PE to perform two (2) sequential sets of 

25 operations on registers included therein while the memory 

either provides a single bit to the memory buffer M or 
receives a single bit from the memory buffer M. 

This example comprises the transfer- of' a- 12-bit 
operand ''A" (i.e., A11A10A9A8A7A6A5A4A3A2A1A0 = 

30 101001010011") from memory, to a memory buffer MO, to a 

data register YSO, and then to a communication register 
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NEWSOO of a first PE; and, the transfer of this operand 
from the communication register NEWSOO to a communication 
register NEWSIl and a data register YSl of a second PE. 
In this example, it is understood that the first and 
5 second PE's are nearest neighboring PE' s that may or may 

not be physically located on the same ASIC or the same 
PCB* It is further understood that the data registers 
YSO and YSl are 12-bit registers, the communication 
registers NEWSOO and NEWSIl are 4-bit registers, and the 

10 memory buffer MO of the first PE is a 1-bit register. 

Further, at least the first PE includes circuitry 
configured for addressing a bit-serial column of memory. 
For example, binary values representative of the operand 
""^A" may be stored in contiguous bit locations of the bit- 

15 serial column of memory. Moreover, the clock speed of 

the first and second PE' s is understood to be twice that 
of the memory - 

In this example, Fig, 3 depicts the contents of the 
YSO, NEWSOO, NEWSIl, and YSl registers and the contents 

20 of the memory buffer MO at the end of each memory cycle- 

During the first memory cycle, the memory provides the 
bit ^"Aii = 1" (i.e., the Most Significant Bit {MSB) of the 
operand ''A") to the memory buffer MO. Because the clock 
speed of the first PE is tv;ice that of the memory, the 

25 bit ^^Aii = 1" is transferred from the memory to the memory 

buffer MO during the first half of the memory clock 
cycle, and the bit "An = V is transferred from the 
memory buffer MO to the LSB position of the data register 
YSO during the second half of the memory clock cycle, 

30 Fig. 3 therefore depicts the contents of the memory 

buffer MO as ""1", and the respective contents of the YSO, 
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NEWSOO, NEWSIl, and YSl registers as ''XXXXXXXXXXXl" , 
. ^^XXXX", ^^XXXX", "XXXXXXXXXXXX" (^^X" = ^Mon't care") at 
the end of the first memory cycle. During the second 
memory cycle, the bit "Aio = 0" is transferred from the 
5 memory to the memory buffer MO during the first half of 

the memory clock cycle, and the first PE shifts the 
contents of the data register YSO left by one bit 
position and loads the vacated bit position with the 
contents of the memory buffer MO during the second half 

10 of the memory clock cycle* Fig. 3 therefore depicts the 

contents of the memory buffer MO as ^'0", and the 
respective contents of the YSO, NEWSOO, NEWSIl, and YSl 
registers as XXXXXXXXXXl 0" , XXXX" , XXXX" ^ 

" XXXXXXXXXXXX" at the end of the second memory cycle. 

15 During the third memory cycle, the bit ^'A^ = 1" is 

transferred from the memory to the memory buffer MO 
during the first half of the memory clock cycle, and the 
first 'PE shifts the contents of the data register YSO 
left by one bit position , and loads the vacated bit 

20 position with the contents of the memory buffer MO during 

the second half of the memory clock cycle. Fig, 3 
therefore depicts the contents of the memory buffer MO as 
^^1", and the respective contents of the YSO, NEWSOO, 
NEWSIl, and YSl registers as " XXXXXXXXXlOl'' , "'XXXX'', 

25 ^^XXXX", ^^XXXXXXXXXXXX" at the end of the third memory 

cycle . 

During the fourth memory cycle, the bit ^''Ag = 0" is 
transferred from the memory to the memory buffer MO 
during the first half of the memory clock cycle, and the 
30 first PE shifts the contents of the data register YSO 

left by one bit position and loads the vacated bit 
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position with the contents of the memory buffer MO during 
the second half of the memory clock cycle. Fig. 3 
therefore depicts the contents of the memory buffer MO as 
'"0", and the respective contents of the YSO, NEWSOO, 
5 NEWSIl, and YSl registers as '^XXXXXXXXIOIO'' , ^^XXXX" , 

^^XXXX'S XXXXXXXXXXXX" at the end of the fourth memory 
cycle. During the fifth memory cycle, the bit "A? = 0" is 
transferred from the memory to the memory buffer MO, and 
the data stored in the four (4) LSB positions of the data 

10 register YO are loaded into the cortununication register 

NEWSOO during the first half of the memory clock cycle; 
and, the first PE shifts the contents of the data 
register YSO left by one bit position and loads the 
vacated bit position with the contents of the memory 

15 buffer MO during the second half of the memory clock- 

cycle. Fig, 3 therefore depicts the contents of the 
memory buffer MO as 0" , and the respective contents of 
the YSO, NEWSOO, NEWSIl, and YSl registers as 
^^XXXXXXXIOIOO" , "1010", ^^XXXX", "XXXXXXXXXXXX" at the end 

20 of the fifth memory cycle. During the sixth memory 

cycle, the bit ^"Ag ^ 1" is transferred from the memory to 
the memory buffer MO during the first half of the memory 
clock cycle, and the first PE shifts the contents of the 
data register YSO left by one bit position and loads the 

25 vacated bit position with the contents of the memory 

buffer MO during the second half of the memory clock 
cycle - Fig. 3 therefore depicts the contents of the 
memory buffer MO as ^'1", and the respective contents of 
the YSO, NEWSOO, NEWSIl, and YSl registers as 

30 "XXXXXXIOIOOI" , ^^1010", "XXXX", "^XXXXXXXXXXXX" at the end 

of the sixth memory cycle. 
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During the seventh memory cycle, the bit "'As - 0" is 
transferred from the memory to the memory buffer MO 
during the first half of the memory clock cycle^ and the 
first PE shifts the contents of the data register YSO 
5 left by one bit position and loads the vacated bit 

position with the contents of the memory buffer MO during 
the second half of the memory clock cycle. Fig, 3 
therefore depicts the contents of the memory buffer MO as 
^'0'\ and the respective contents of the YSO, NEWSOO, 

10 NEWSIl, and YSl registers as " XXXXXIOIOOIC' , "1010", 

''XXXX'', XXXXXXXXXXXX'' at the end of the seventh memory 
cycle. During the first half of the eighth memory cycle, 
the bit ^^A4 = V is transferred from the memory to the 
memory buffer MO^ and the data stored in the 

15 communication register NEWSOO is loaded into the 

communication register NEWSll. During the second half of 
the eighth memory cycle, the first PE shifts the contents 
of the data register YSO left by one bit position, and 
loads the vacated bit position with the contents of the 

20 memory buffer MO. Fig. 3 therefore depicts the contents 

of the memory buffer MO as ^'1", and the respective 
contents of the YSO, NEWSOOr NEWSll, and YSl registers as 
"XXXXIOIOOIOI" , "1010", ^^1010", XXXXXXXXXXXX'^ at the end 
of the eighth memory cycle. During the first half of the 

25 ninth memory cycle, the bit ^'A3 = 0" is transferred from 

the memory to the memory buffer MO, the data stored in 
the four (4) LSB positions of the data register YO are 
loaded into the communication register NEWSOO, and the 
second PE loads the least significant four (4) bits of 

30 the data register YSl with the contents of the 

communication register NEWSll. During the second half of 



BNSDOCID: <WO ^0131418A2J_> 



wo 01/31418 



PCT/USOO/41571 



-15- 

the ninth memory cycle, the second PE shifts the contents 
cf the data register YSl left by one bit position, and 
Lh-^ first PE shifts the contents of the data register YSO 
leiz by one bit position and loads the vacated bit 
5 posirion with the contents of the memory buffer MO. Fig, 

5 therefore depicts the contents of the memory buffer HO 
as ^'0", and the respective contents of the YSO, NEWSOO, 
HEWSIl, and YSl registers as XXXI 01 00 1010^', ^^0101", 
"1010''^ '^XXXXXXXIOIOX'' at the end of the ninth memory 
10 cycle. 

During the first half of the tenth memory cycle, the 
bit ^^A2 = 0" is transferred from the memory to the memory 
buffer MO. During the second half of the tenth memory 
cycle, the second PE shifts the contents of the data 

15 register YSl left by one bit position, and the first PE 

shifts the contents of the data register YSO left by one 
bit position and loads the vacated bit position with the 
contents of the memory buffer MO. Fig. 3 therefore 
depicts the contents of the memory buffer MO as C , and 

20 the respective contents of the YSO, NEWSOO, NEWSIl, and 

YSl registers as XXIOIOOIOIOO'% ^^0101", ^^1010'', 

^'XXXXXXIOIOXX'^ at the end of the tenth memory cycle. 
During the first half of the eleventh memory cycle, the 
bit ^'Ai = 1" is transferred from the memory to the memory 

2 5 buffer MO. During the second half of the eleventh memory 

cycle, the second PE shifts the contents of the data 
register YSl left by one bit position, and the first PE 
shifts the contents of the data register YSO left by one 
bit position and loads the vacated bit position with the 

30 contents of the memory buffer MO. Fig. 3 therefore 

depicts the contents of the memory buffer MO as ^"1"^ and 
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the respective contents of the YSO, NEWSOO, NEWSIl, and 
YSl registers as ^^XIOIOOIOIOOI" , "^0101", ^^iOlO", 
''^XXXXXIOIOXXX" at the end of the eleventh memory cycle. 
During the first half of the twelfth memory cycle, the 
5 bit "Ao = 1" is transferred from the memory to the memory 

buffer MO, and the data stored in the communication 
register NEWSOO is loaded into the communication register 
NEWSIl . During the second half of the twelfth memory 
cycle, the second PE shifts the contents of the data 

10 register YSl left by one bit position, and the first PE 

shifts the contents of the data register YSO left by one 
bit position and loads the vacated bit position with the 
contents of the memory buffer MO • Fig, 3 therefore 
depicts the contents of the memory buffer MO as ^'1", and 

15 the respective contents of the YSO, NEWSOO, NEWSIl, and 

YSl registers ^^101001010011", ^"0101", "^0101", 

"'XXXXIOIOXXXX" at the end of the twelfth memory cycle. 
At this time, it is noted that the YSO register of the 
first PE contains the 12-bit operand "A" . 

20 During the first half of the thirteenth memory 

cycle, the data stored in the four (4) LSB positions of 
the data register YO are loaded into the communication 
register NEWSOO, and the second PE loads the least 
significant four (4) bits of the data register YSl with 

25 the contents of the communication register NEWSIl. 

During the second half of the thirteenth memory cycle, 
the second PE shifts the contents of the data register 
YSl left by one bit position* It is noted that because 
every bit of the operand ^^A" -has been loaded into the 

30 data register YSO during the first through twelfth memory 

cycles, the steps of shifting and loading the data 
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register YSO no longer need to be performed during the 
second half of the memory cycle* Fig, 3. therefore 
depicts the contents of the memory buffer MO as "X", and 
the respective contents of the YSO, NEWSOO, NEWSIl, and 
5 YSl registers as ^^101001010011", ^'OOirS ^^OlOr', 

"XXXIOIOOIOIX" at the end of the thirteenth memory cycle. 
During the fourteenth memory cycle, the second PE shifts 
the contents of the data register YSl left by one bit 
position. Fig, 3 therefore depicts the contents of the 

10 memory buffer MO as ^'X'', and the respective contents of 

the YSO, NEWSOO, NEWSIl, and YSl registers as 
"^^XXXXXXXXXXXX'' , ^^0011", "0101", "XXIOIOOIOIXX" at the end 
of the fourteenth memory cycle. During the fifteenth 
memory cycle, the second PE shifts the contents of the 

15 data register YSl left by one bit position* Fig. 3 

therefore depicts the contents of the memory buffer MO as 
"X", and the respective contents of the YSO, NEWSOO, 
NEWSIl, and YSl registers as XXXXXXXXXXXX'' , "0011^', 
'"0101", '"XIOIOOIOIXXX" at the end of the fifteenth memory 

20 cycle. 

During the sixteenth memory cycle, the data stored 
in the communication register NEWSOO is loaded into the 
communication register NEWSIl, and the second PE shifts 
the contents of the data register YSl left by one bit 

25 position. Fig. 3 therefore depicts the contents of the 

memory buffer MO as ^^X"; and, the respective contents of 
the YSO, NEWSOO, NEWSIl, and YSl registers as 
^'^XXXXXXXXXXXX" , "0011", ^^0011", and lOlOOlOlXXXX" at the 
end of the sixteenth memory cycle • Finally, during the 

30 seventeenth memory cycle, the data stored in the 

communication register NEWSIl is loaded into the four (4) 



BNSDOCID: <WO ^013l41QA2_l_> 



wo 01/3 141 8 PCT/USOO/41571 



-18- 



LSB positions of the data register YSl, thereby 
completing the transfer of the 12-bit operand "A" from 
the first PE to the second PE. Fig. 3 therefore depicts 
the contents of the memory buffer MO as ^^X", and the 
5 respective contents of the YSO, NEWSOO, NEWSIl, and YSl 

registers as XXXXXXXXXXXX" , ^^0011", "0011", 

101001010011" at the end of the seventeenth memory 
cycle. At this time, it is noted that the YSl register 
of the second PE contains the 12-bit operand ''A'' . 

10 In this illustrative example^ it should be 

understood that the step of loading in parallel the data 
stored in the communication register NEWSOO of the first 
PE into the communication register NEWSIl of the second 
PE is performed concurrently with the process of serially 

15 reading the operand '^A" from the memory during the eighth 

and twelfth memory cycles, as described above. It should 
also be understood that the communication register NEWSOO 
of the first PE holds its data for nearly eight (8) PE 
clock cycles during the fifth through eighth memory 

20 cycles, the ninth through twelfth memory cycles, and the 

thirteenth through sixteenth memory cycles to allow the 
data to settle before loading it into the communication 
register NEWSIl of the second PE . It is also noted that 
the transfer of the 12-bit operand ^'A" from the first to 

2 5 the second PE is completed in sixteen (17) memory cycles. 

In general, the SIMD array processor 100 transfers an n- 
bit operand between nearest neighboring PE' s in (n + 5) 
memory cycles* Such data transfers therefore have a 
latency of five (5) memory cycles, as represented by, 

30 e-g-f the first through fourth^ and seventeenth memory 

cycles of the above-described illustrative example. 
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Having described the above embodiment, other 
alternative embodiments or variations may be made* 
Specifically, it was described that each of the PE' s 104 
through 134 of the SIMD array processor 100 (see Fig. 1) 
5 is connected to one and only one of the I/O's 0 through 

15 of the RAM 102 to provide 1-bit serial access to data 
contained therein, and that the data register YS 204 of 
the PE 200 (see Fig, 2) is preferably 64-bits wide. 
However, it should be understood that the SIMD array 

10 processor may alternatively provide n-bit serial access 

to data stored in memory, and the YS register of the PE 
may alternatively be m-bits wide, wherein "m" is 
preferably greater than "n" , The memory buffer M of the 
PE may therefore be n-bits wide to provide the n-bit 

15 serial data access. The SIMD array processor of this 

alternative embodiment provides enhanced data transfer 
efficiency by loading in parallel data stored in the 
communication register NEWSOO of a PE into the 
communication register NEWSIl of a neighboring PE while 

20 serially reading at least one (1) bit of an operand from 

memory. 

Those of ordinary skill in the art should further 
appreciate that variations to and modification of the 
above-described SIMD array processor may be made without 
2 5 departing from the inventive concepts disclosed herein* 

Accordingly, the present invention should be viewed as 
limited solely by the scope and spirit of the appended 
claims* 
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CLAIMS 

What is claimed is: 



1. A single-instruction multiple-data array processor, 
5 comprising: 

at least one memory; and 

a plurality of mesh-connected processing elem.ents, 
each processing element including at least one memory 
buffer, at least one data register, at least one first 

10 communication register, and at least one second 

communication .register, the memory buffer being adapted 
to transfer data between the memory and the data 
register, the first communication register being adapted 
to transfer data between the data register and the second 

15 communication register of a neighboring processing 

element, 

wherein each of the memory buffer, the data 
register, and the first and second communication 
registers has a width measured in bits, the width of the 

20 data register being greater than the width of the memory 

buffer and a multiple of the width of the memory buffer^ 
and the data register being at least as wide as the first 
and second communication registers. 

2-5 2- The single^instruction multiple-data array processor 

of claim 1 wherein the first communication register is 
adapted to transfer data to the second communication 
register of the neighboring processing element while the 
memory buffer transfers data between the memory and the 

30 data register. 
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3- The single-instruction multiple-data array processor 
of claim 1 wherein each processing element is adapted to 
perform data processing operations using at least a 
portion of the data registers while the memory buffer 

5 transfers data between the memory and at least one of the 

data registers. 

4- The single-instruction multiple^data array processor 
of claim 1 wherein the memory buffer is further adapted 

10 to transfer data serially between the memory and the data 

register , 

5. The single-instruction multiple-data array processor 
of claim 1 wherein the first communication register is 
15 further adapted to transfer data in parallel between the 

data register and the second communication register of 
the neighboring processing element- 
ed The single-instruction multiple-data array processor 
of claim 1 wherein the memory includes a plurality of 
bit-serial columns of memory locations, and the memory 
buffer is further adapted to transfer data serially 
between the data register and a selected one of the bit- 
serial columns of memory. 

7. The single-instruction multiple-data array processor 
of claim 1 wherein the memory is a synchronous dynamic 
random access memory. 
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8. The single-instruction multiple-data array processor 
of claim 1 wherein the plurality of mesh-connected 
processing elements comprises a NEWS array. 

5 9. A method of operating a single-instruction multiple 

array processor, the array processor including at least 
one memory and a plurality of mesh-connected processing 
elements, each processing element including a memory 
buffer, a data register, a first communication register, 

10 and a second communication register, the method 

comprising the steps of: 

transferring data between the memory and a first 
data register by a first memory buffer, the first data 
register and the first memory buffer being included in a 

15 first processing element; and 

transferring the data between the first data 
register and a second communication register by a first 
communication register, the first communication register 
being included in the first processing element and the 

20 second communication register being included in a second 

processing element, 

wherein a first portion of the data is transferred 
in the second transferring step while a second portion of 
the data is being transferred in the first transferring 

25 step. 



10. The method of claim 9 wherein the data are 
transferred in parallel between the first data register 
and the first communication register in the second 
30 transferring step. 
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11, The method of claim 9 wherein the data are 
transferred between the memory and the first data 
register in the first transferring step in a bit-serial 
manner . 

5 

12, The method of claim 9 wherein the data comprise an 
n-bit data word and the data is transferred from the 

memory to the second communication register in the first 
and second transferring steps in (n + 5) operating cycles 
10 of the memory. 
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